::p_load(rgdal, sf, spdep, tmap, tidyverse, ClustGeo, ggpubr, cluster, factoextra, NbClust, heatmaply, corrplot, psych, GGally, funModeling) pacman
Take-Home Exercise 2 - Geospatial Analytics for Social Good
1. Overview
1.1. Background
The process of creating regions is called regionalisation. A regionalisation is a special kind of clustering where the objective is to group observations which are similar in their statistical attributes, but also in their spatial location. In this sense, regionalization embeds the same logic as standard clustering techniques, but also applies a series of geographical constraints. Often, these constraints relate to connectivity: two candidates can only be grouped together in the same region if there exists a path from one member to another member that never leaves the region. These paths often model the spatial relationships in the data, such as contiguity or proximity. However, connectivity does not always need to hold for all regions, and in certain contexts it makes sense to relax connectivity or to impose different types of geographic constraints.
1.2. Overall Goal
This study aims to apply appropriate clustering techniques to reveal the spatial patterns of water points in Nigeria.
1.3. Key Objectives
Using the appropriate R packages, we will need to:
Prepare the dataset and save it in simple feature data frameformat, as well as derive the proportion of functional and non-functional water point at LGA level
Conduct thematic mapping analysis to examine the spatial distribution of water points at LGA level
Conduct hotspot areas and outliers/clusters maps of water points at LGA level
2. Setup
2.1 Packages Used
The R packages that we will be using for this analysis area:
sf, rgdal: used for importing, managing, and processing geospatial data
spdep: used for computing spatial weights, global and local spatial auto-correlation statistics
tidyverse: used for wrangling attribute data
tmap: used for creating cartographic quality choropleth map
coorplot, ggpubr, heatmaply: used for multivariate data visualization and analysis
cluster, ClustGeo: used for cluster analysis
funModeling: used for exploratory data analysis, data preparation and model performance
In addition, the following tidyverse packages will be used:
readr for reading rectangular data from csv, tsv and fwf
tidyr for manipulating and tidying data
dplyr for wrangling and transforming data
ggplot2 for visualising data
2.2 Datasets Used
2 geospatial datasets will be utilized for this study:
geo_export_338e5689-bd72-4866-bfde-8997933e9897
WPdx+ dataset from WPdx Global Data Repositories
nga_admbnda_adm2_osgof_20190417
Nigeria Level-2 Administrative Boundary (also known as Local Government Area) polygon features GIS data from geoBoundaries
2.3 Launching the packages in R
The code chunk below is used to perform the following tasks:
creating a package list containing the necessary R packages,
checking if the R packages in the package list have been installed in R,
- if they have yet to be installed, RStudio will installed the missing packages,
launching the packages into R environment.
3. Data Preparation
In this section, we will bring geospatial data into R environment. The geospatial data is in ESRI shapefile format.
3.1 Import water point shapefile into R environment
The code chunk below uses st_read() of sf package to import Nigeria shapefile into R. The imported shapefile will be simple features Object of sf.
<- st_read(dsn = "geodata",
wp2 layer = "geo_export_338e5689-bd72-4866-bfde-8997933e9897",
crs = 4326) %>%
filter(clean_coun == "Nigeria") %>%
select(1, 3:4, 11:15, 23, 36:40, 43:50)
The code chunk below uses write_rds() of readr package to save the extracted sf data table (i.e. wp) into an output in rds data format. The output file is called wp_nga2.rds and it is saved in geodata sub-folder.
<- write_rds(wp2, "geodata/wp_nga2.rds") wp_nga2
3.2 Import Nigeria LGA boundary data into R environment
We are going to import LGA boundary data into R environment using the following code chunk, st_read() of sf package. It is used to import nga_admbnda_adm2_osgof_20190417 shapefile and save the imported geospatial data into simple feature data table.
<- st_read(dsn = "geodata",
nga2 layer = "nga_admbnda_adm2_osgof_20190417",
crs = 4326)
3.3 Recoding NA values into string
Use replace_na() to recode all the NA values in status_cle field into the Unknown.
<- read_rds("geodata/wp_nga2.rds") %>%
wp_nga2 mutate(status_cle = replace_na(status_cle, "Unknown"))
3.4 Checking of duplicated area name
We will order our dataframe by alphabetical order based on ADM2_REF and use the duplicated function to retrieve all ADM2_REF that has duplicates and store it in a list.
<- (nga2[order(nga2$ADM2_REF), ])
nga2
<- nga2$ADM2_REF [nga2$ADM2_REF %in%
duplicate_area $ADM2_REF[duplicated(nga2$ADM2_REF)]]
nga2
duplicate_area
From the results, we identified 12 ADM2_REF that are duplicates. We will then leverage on interactive view of tmap to check the location of each area. With the help of Google Map, we will retrieve the actual name and state of these areas.
Index | Actual Area Name |
---|---|
94 | Bassa (Kogi) |
95 | Bassa (Plateau) |
304 | Ifelodun (Kwara) |
305 | Ifelodun (Osun) |
355 | Irepodun (Kawara) |
356 | Irepodun (Osun) |
519 | Nasarawa (Kano) |
520 | Nasarawa West |
546 | Obi (Benue) |
547 | Obi (Nasarawa) |
693 | Surulere (Lagos) |
694 | Surulere (Oyo) |
tmap_mode("view")
tm_shape(nga2[nga2$ADM2_REF %in% duplicate_area,]) +
tm_polygons()
tmap_mode("plot")
We will now access the individual index of the nga2
data frame and change the value. Lastly, we use the length()
function to ensure there is no more duplicated ADM2_REF.
$ADM2_REF[c(94,95,304,305,355,356,519,546,547,693,694)] <- c("Bassa (Kogi)","Bassa (Plateau)", "Ifelodun (Kwara)","Ifelodun (Osun)", "Irepodun (Kwara)","Irepodun (Osun)", "Nasarawa (Kano)","Obi (Benue)","Obi(Nasarawa)", "Surulere (Lagos)","Surulere (Oyo)")
nga2
length((nga2$ADM2_REF[ nga2$ADM2_REF %in% nga2$ADM2_REF[duplicated(nga2$ADM2_REF)] ]))
3.5 Perform data binning for usage capacity field
Before we perform data binning, we will review the summary statistics of usage capacity field using the code chunk below. Based on the We will then use cut() function to categorize the values under usage_cap field to <1000 and >=1000.
summary(wp_nga2$usage_cap)
<- wp_nga2 %>%
wp_nga2 mutate(usage_cap_bin = cut(usage_cap, breaks = c(0, 999, Inf), labels = c("<1000", ">=1000")))
4. Exploratory Data Analysis via Statistical Graphics
Use freq() of funModeling package to display the distribution of status_cle, X_water_tec, usage_cap_bin, is_urban field in wp_nga2.
freq(data = wp_nga2,
input = "status_cle")
freq(data = wp_nga2,
input = "X_water_tec")
freq(data = wp_nga2,
input = "usage_cap_bin")
freq(data = wp_nga2,
input = "is_urban")
4.1 Extract functional water point data
In the code chunk below, filter() of dplyr is used to select functional water points.
<- wp_nga2 %>%
wpt_functional filter(status_cle %in%
c("Functional",
"Functional but not in use",
"Functional but needs repair"))
freq(data=wpt_functional,
input = 'status_cle')
4.2 Extract non-functional water point data
In the code chunk below, filter() of dplyr is used to select non-functional water points.
<- wp_nga2 %>%
wpt_nonfunctional filter(status_cle %in%
c("Abandoned/Decommissioned",
"Abandoned",
"Non-Functional",
"Non functional due to dry season",
"Non-Functional due to dry season"))
freq(data=wpt_nonfunctional,
input = 'status_cle')
4.3 Extract water point data with Unknown class
In the code chunk below, filter() of dplyr is used to select unknown water points.
<- wp_nga2 %>%
wpt_unknown filter(status_cle == "Unknown")
4.4 Extract water point data with Hand Pump technology
In the code chunk below, filter() of dplyr is used to select hand pump water points.
<- wp_nga2 %>%
wpt_handpump filter(X_water_tec == "Hand Pump")
4.5 Extract LGA usage capacity
In the code chunk below, filter() of dplyr is used to select LGA usage capacities for <1000 and >=1000 respectively.
<- wp_nga2 %>%
lga_usage_cap_below1000 filter(usage_cap_bin == "<1000")
<- wp_nga2 %>%
lga_usage_cap_atleast1000 filter(usage_cap_bin == ">=1000")
4.6 Extract rural water points
In the code chunk below, filter() of dplyr is used to select rural water points.
<- wp_nga2 %>%
wpt_rural filter(is_urban == "False")
4.7 Performing Point-in-Polygon Count
The code chunk below performs 2 operations at one go. Firstly, it uses st_intersects() to identify the various water point types (e.g. total, functional, non-functional, hand pump), usage capacity and rural water points located inside each LGA boundary. Next, length() of Base R is used to calculate the number of water points that fall within each LGA boundary.
<- nga2 %>%
nga_wp2 mutate(`total wpt` = lengths(
st_intersects(nga2, wp_nga2))) %>%
mutate(`wpt functional` = lengths(
st_intersects(nga2, wpt_functional))) %>%
mutate(`wpt non-functional` = lengths(
st_intersects(nga2, wpt_nonfunctional))) %>%
mutate(`wpt unknown` = lengths(
st_intersects(nga2, wpt_unknown))) %>%
mutate(`wpt handpump` = lengths(
st_intersects(nga2, wpt_handpump))) %>%
mutate(`lga usage cap b1000` = lengths(
st_intersects(nga2, lga_usage_cap_below1000))) %>%
mutate(`lga usage cap a1000` = lengths(
st_intersects(nga2, lga_usage_cap_atleast1000))) %>%
mutate(`wpt rural` = lengths(
st_intersects(nga2, wpt_rural)))
4.8 Saving the Analytical Data Table
The code chunk below uses mutate() of dplyr package to derive 6 fields namely pct_functional, pct_non-functional, pct_handpump, pct_usagecap_below1000, pct_usagecap_atleast1000 and pct_rural. In order to keep the file size small, select() of dplyr is used to retain on the relevant fields.
<- nga_wp2 %>%
nga_wp2 mutate(pct_functional = `wpt functional`/`total wpt`) %>%
mutate(`pct_non-functional` = `wpt non-functional`/`total wpt`) %>%
mutate(`pct_handpump` = `wpt handpump`/`total wpt`) %>%
mutate(`pct_usagecap_below1000` = `lga usage cap b1000`/`total wpt`) %>%
mutate(`pct_usagecap_atleast1000` = `lga usage cap a1000`/`total wpt`) %>%
mutate(`pct_rural` = `wpt rural`/`total wpt`)
Thereafter, we will save the sf data table in rds format for subsequent analysis.
write_rds(nga_wp2, "geodata/nga_wp2.rds")
5. Exploratory Spatial Data Analysis via Choropleth Map
5.1 Preparing a choropleth map - Functional Water Points
To take a quick look at the distribution of the water points in LGA of Nigeria, a choropleth map will be prepared. The code chunks below are used to prepare the choropleth by using qtm() function of tmap package.
qtm(nga_wp2, "pct_functional")
In order to assess the level of bias in the dataset for functional water points, we will create two choropleth maps, one for the total functional water points and one for the percentage of functional water points by using the code chunk below.
tm_shape(nga_wp2) +
tm_polygons(c("wpt functional", "pct_functional"),
style = "jenks") +
tm_facets(sync = TRUE, ncol = 2) +
tm_legend(legend.position = c("right", "bottom")) +
tm_layout(outer.margins=0, asp=0)
Based on the results above, it shows that there is a higher concentration of functional water points in the northern part of Nigeria.
5.2 Preparing a choropleth map - Non-Functional Water Points
Similar to Point 5.1, we will use similar code chunks to prepare choropleth map for non-functional water points.
qtm(nga_wp2, "pct_non-functional")
In order to assess the level of bias in the dataset for non-functional water points, we will create two choropleth maps, one for the total non-functional water points and one for the percentage of non-functional water points by using the code chunk below.
tm_shape(nga_wp2) +
tm_polygons(c("wpt non-functional", "pct_non-functional"),
style = "jenks") +
tm_facets(sync = TRUE, ncol = 2) +
tm_legend(legend.position = c("right", "bottom")) +
tm_layout(outer.margins=0, asp=0)
Based on the results above, it shows that there is a higher concentration of non-functional water points in the southern part of Nigeria.
5.3 Preparing a choropleth map - Handpump Water Points
Similar to Point 5.1, we will use similar code chunks to prepare choropleth map for handpump water points.
qtm(nga_wp2, "pct_handpump")
tm_shape(nga_wp2) +
tm_polygons(c("wpt handpump", "pct_handpump"),
style = "jenks") +
tm_facets(sync = TRUE, ncol = 2) +
tm_legend(legend.position = c("right", "bottom")) +
tm_layout(outer.margins=0, asp=0)
Based on the results above, it shows that there is a higher concentration of handpump water points in the northern part of Nigeria i.e. regions with lower GDP.
5.4 Preparing a choropleth map - Usage Capacity Below 1000
Similar to Point 5.1, we will use similar code chunks to prepare choropleth map for water points with usage capacity below 1000.
qtm(nga_wp2, "pct_usagecap_below1000")
tm_shape(nga_wp2) +
tm_polygons(c("lga usage cap b1000", "pct_usagecap_below1000"),
style = "jenks") +
tm_facets(sync = TRUE, ncol = 2) +
tm_legend(legend.position = c("right", "bottom")) +
tm_layout(outer.margins=0, asp=0)
Based on the results above, water points with usage capacity below 1000 are quite prevalent across Nigeria especially in the northern and south-eastern part of Nigeria.
5.5 Preparing a choropleth map - Usage Capacity At Least 1000
Similar to Point 5.1, we will use similar code chunks to prepare choropleth map for water points with usage capacity at least 1000.
qtm(nga_wp2, "pct_usagecap_atleast1000")
tm_shape(nga_wp2) +
tm_polygons(c("lga usage cap a1000", "pct_usagecap_atleast1000"),
style = "jenks") +
tm_facets(sync = TRUE, ncol = 2) +
tm_legend(legend.position = c("right", "bottom")) +
tm_layout(outer.margins=0, asp=0)
Based on the results above, water points with usage capacity at least 1000 are quite prevalent in the southern part of Nigeria.
5.6 Preparing a choropleth map - Rural Water Points
Similar to Point 5.1, we will use similar code chunks to prepare choropleth map for rural water points.
qtm(nga_wp2, "pct_rural")
tm_shape(nga_wp2) +
tm_polygons(c("wpt rural", "pct_rural"),
style = "jenks") +
tm_facets(sync = TRUE, ncol = 2) +
tm_legend(legend.position = c("right", "bottom")) +
tm_layout(outer.margins=0, asp=0)
Based on the results above, rural water points are quite prevalent across of Nigeria.
6. Correlation Analysis
Based on choropleth maps in Section 5, there are missing values. We will replace all the NaN values with Null values.
$pct_functional[is.nan(nga_wp2$pct_functional)] <-0
nga_wp2$`pct_non-functional`[is.nan(nga_wp2$`pct_non-functional`)] <-0
nga_wp2$pct_handpump[is.nan(nga_wp2$pct_handpump)] <-0
nga_wp2$pct_usagecap_below1000[is.nan(nga_wp2$pct_usagecap_below1000)] <-0
nga_wp2$pct_usagecap_atleast1000[is.nan(nga_wp2$pct_usagecap_atleast1000)] <-0
nga_wp2$pct_rural[is.nan(nga_wp2$pct_rural)] <-0 nga_wp2
Before we perform cluster analysis, it is important for us to ensure that the cluster variables are not highly correlated. We will use corrplot.mixed() function of corrplot package to visualise and analyse the correlation of the input variables.
= cor(st_set_geometry(nga_wp2[,26:31], NULL))
cluster_vars.cor corrplot.mixed(cluster_vars.cor,
lower = "ellipse",
upper = "number",
tl.pos = "lt",
diag = "l",
tl.col = "black")
The correlation plot above shows that pct_usagecap_below1000 and pct_usagecap_atleast1000 are highly correlated (>0.85). This suggest that only one of them should be used in the cluster analysis instead of both.
7. Hierarchy Cluster Analysis
We will perform hierarchical cluster analysis on the dataset.
7.1 Extracting clustering variables
The code chunk below will be used to extract the clustering variables from the nga_wp2 simple feature object into data.frame.
<- nga_wp2 %>%
cluster_vars st_set_geometry(NULL) %>%
select("ADM2_REF", "pct_functional", "pct_non-functional", "pct_handpump", "pct_usagecap_below1000", "pct_rural")
head(cluster_vars,10)
We need to change the rows by township name instead of row number. Thereafter, we will delete the ADM2_REF field by using the code chunk below.
row.names(cluster_vars) <- cluster_vars$"ADM2_REF"
head(cluster_vars,10)
<- select(cluster_vars, c(2:6))
nga_wp2_hc head(nga_wp2_hc, 10)
7.2 Visualizing raw values without data standardization
We will now visualize the various raw values using ggplot() function. Based on the histogram, it seems like there is a need to standardize the values.
<- ggplot(data = nga_wp2_hc,
f aes(x= `pct_functional`)) +
geom_histogram(bins=20,
color="black",
fill="light blue") +
ggtitle("Raw values without standardisation - Functional")
<- ggplot(data = nga_wp2_hc,
nf aes(x= `pct_non-functional`)) +
geom_histogram(bins=20,
color="black",
fill="light blue") +
ggtitle("Raw values without standardisation - Non-Functional")
<- ggplot(data = nga_wp2_hc,
h aes(x= `pct_handpump`)) +
geom_histogram(bins=20,
color="black",
fill="light blue") +
ggtitle("Raw values without standardisation - Handpump")
<- ggplot(data = nga_wp2_hc,
ucap aes(x= `pct_usagecap_below1000`)) +
geom_histogram(bins=20,
color="black",
fill="light blue") +
ggtitle("Raw values without standardisation - Usage Cap Below 1000")
<- ggplot(data = nga_wp2_hc,
r aes(x= `pct_rural`)) +
geom_histogram(bins=20,
color="black",
fill="light blue") +
ggtitle("Raw values without standardisation - Rural")
ggarrange(f, nf, h, ucap, r,
ncol = 3,
nrow = 2)
7.3 Z-score standardization
The code chunk below will be used to standardise the clustering variables by using Z-score method.
<- scale(nga_wp2_hc)
nga_wp2_hc.z describe(nga_wp2_hc.z)
7.4 Visualizing standardized clustering variables
The code chunk below plot the 5 scaled fields.
<- as.data.frame(nga_wp2_hc.z)
nga_wp2_hc_z_df <- ggplot(data=nga_wp2_hc_z_df,
fz aes(x=`pct_functional`)) +
geom_histogram(bins=20,
color="black",
fill="light blue") +
ggtitle("Z-score Standardisation - Functional")
<- as.data.frame(nga_wp2_hc.z)
nga_wp2_hc_z_df <- ggplot(data=nga_wp2_hc_z_df,
nfz aes(x=`pct_non-functional`)) +
geom_histogram(bins=20,
color="black",
fill="light blue") +
ggtitle("Z-score Standardisation - Non-Functional")
<- as.data.frame(nga_wp2_hc.z)
nga_wp2_hc_z_df <- ggplot(data=nga_wp2_hc_z_df,
hz aes(x=`pct_handpump`)) +
geom_histogram(bins=20,
color="black",
fill="light blue") +
ggtitle("Z-score Standardisation - Handpump")
<- as.data.frame(nga_wp2_hc.z)
nga_wp2_hc_z_df <- ggplot(data=nga_wp2_hc_z_df,
ucapz aes(x=`pct_usagecap_below1000`)) +
geom_histogram(bins=20,
color="black",
fill="light blue") +
ggtitle("Z-score Standardisation - Usage Cap Below 1000")
<- as.data.frame(nga_wp2_hc.z)
nga_wp2_hc_z_df <- ggplot(data=nga_wp2_hc_z_df,
rz aes(x=`pct_rural`)) +
geom_histogram(bins=20,
color="black",
fill="light blue") +
ggtitle("Z-score Standardisation - Rural")
ggarrange(fz, nfz, hz, ucapz, rz,
ncol = 3,
nrow = 2)
7.5 Computing proximity matrix
We will compute the proximity matrix by using dist() of R. The code chunk below is used to compute the proximity matrix using euclidean method.
<- dist(nga_wp2_hc_z_df, method = 'euclidean')
proxmat
proxmat
7.6 Computing hierarchical clustering
We will use hclust() of R stats to compute hierarchical clustering.The code chunk below performs hierarchical cluster analysis using ward.D method. The hierarchical clustering output is stored in an object of class hclust which describes the tree produced by the clustering process.
<- hclust(proxmat, method = 'ward.D') hclust_ward
We can then plot the tree by using plot() of R Graphics as shown in the code chunk below.
plot(hclust_ward, cex = 0.1)
7.7 Selecting the optimal clustering algorithm
The code chunk below will be used to compute the agglomerative coefficients of all hierarchical clustering algorithms (values closer to 1 suggest strong clustering structure).
<- c( "average", "single", "complete", "ward")
m names(m) <- c( "average", "single", "complete", "ward")
<- function(x) {
ac agnes(nga_wp2_hc_z_df, method = x)$ac
}
map_dbl(m, ac)
With reference to the output above, we can see that Ward’s method provides the strongest clustering structure among the four methods assessed. Hence, in the subsequent analysis, only Ward’s method will be used.
7.8 Determining Optimal Clusters
We will use gap statistic method to help determine the optimal cluster. To compute the gap statistic, clusGap() of cluster package will be used.
set.seed(12345)
<- clusGap(nga_wp2_hc_z_df,
gap_stat FUN = hcut,
nstart = 25,
K.max = 20,
B = 50)
# Print the result
print(gap_stat, method = "firstmax")
Also note that the hcut function used is from factoextra package.
Next, we can visualise the plot by using fviz_gap_stat() of factoextra package.
fviz_gap_stat(gap_stat)
By examine the gap statistic graph, the 18-cluster gives the largest gap statistic and should be the next best cluster to pick.
7.9 Visually-driven hierarchical clustering analysis
We will apply heatmaply package to build both highly interactive cluster heatmap or static cluster heatmap.
7.9.1 Transforming the data frame into a matrix
The data was loaded into a data frame, but it has to be a data matrix to make a heatmap.
The code chunk below will be used to transform nga_wp2_hc_z_df data frame into a data matrix.
<- data.matrix(nga_wp2_hc_z_df) nga_wp2_hc_z_df_mat
7.9.2 Plotting interactive cluster heatmap using heatmaply()
In the code chunk below, the heatmaply() of heatmaply package is used to build an interactive cluster heatmap.
heatmaply(normalize(nga_wp2_hc_z_df_mat),
Colv=NA,
dist_method = "euclidean",
hclust_method = "ward.D",
seriate = "OLO",
colors = Blues,
k_row = 18,
margins = c(NA,200,60,NA),
fontsize_row = 4,
fontsize_col = 5,
main="Geographic Segmentation of Nigeria State by different indicators",
xlab = "Different Indicators",
ylab = "Townships of Nigeria State"
)
Based on the results above, it show that the many regions in Nigeria has a high correlation with pct_rural variables compared to pct_non-functional variables.
7.10 Mapping the clusters formed
With closed examination of the dendrogram above, we have decided to retain 18 clusters.
cutree() of R Base will be used in the code chunk below to derive a 18-cluster model.
<- as.factor(cutree(hclust_ward, k=18)) groups
The code chunk below form the join in three steps:
the groups list object will be converted into a matrix;
cbind() is used to append groups matrix onto nga_wp2 to produce an output simple feature object called
nga_wp2_cluster
; andrename of dplyr package is used to rename as.matrix.groups field as CLUSTER.
<- cbind(nga_wp2, as.matrix(groups)) %>%
nga_wp2_cluster rename(`CLUSTER`=`as.matrix.groups.`)
Next, qtm() of tmap package is used to plot the choropleth map showing the cluster formed.
qtm(nga_wp2_cluster, "CLUSTER")
The choropleth map above reveals the clusters are very fragmented, especially in the central and southern part of Nigeria. There seems to be a more homogeneous distribution of similar clusters in the northern part of Nigeria.
8. Spatially Constrained Clustering: ClustGeo Method
Using the ClustGeo package to perform non-spatially constrained hierarchical cluster analysis and spatially constrained cluster analysis.
8.1 Ward-like hierarchical clustering: ClustGeo
To perform non-spatially constrained hierarchical clustering, we only need to provide the function a dissimilarity matrix as shown in the code chunk below.
<- hclustgeo(proxmat)
nongeo_cluster plot(nongeo_cluster, cex = 0.5)
rect.hclust(nongeo_cluster,
k = 18,
border = 2:5)
8.2 Mapping the clusters formed
<- as.factor(cutree(nongeo_cluster, k=18))
groups
<- cbind(nga_wp2, as.matrix(groups)) %>%
nga_wp2_ngeo_cluster rename(`CLUSTER` = `as.matrix.groups.`)
qtm(nga_wp2_ngeo_cluster, "CLUSTER")
8.3 Spatially Constrained Hierarchical Clustering
Before we can performed spatially constrained hierarchical clustering, a spatial distance matrix will be derived by using st_distance()
of sf package.
<- st_distance(nga_wp2, nga_wp2)
dist <- as.dist(dist) distmat
Notice that as.dist()
is used to convert the data frame into matrix.
Next, choicealpha()
will be used to determine a suitable value for the mixing parameter alpha as shown in the code chunk below.
<- choicealpha(proxmat, distmat, range.alpha = seq(0, 1, 0.1), K=18, graph = TRUE) cr
With reference to the graphs above, alpha = 0.35 will be used as shown in the code chunk below.
<- hclustgeo(proxmat, distmat, alpha = 0.35) clustG
Next, cutree()
is used to derive the cluster object.
<- as.factor(cutree(clustG, k=18)) groups
We will then join back the group list with nga_wp2 polygon feature data frame by using the code chunk below.
<- cbind(nga_wp2, as.matrix(groups)) %>%
nga_wp2_Gcluster rename(`CLUSTER` = `as.matrix.groups.`)
We can now plot the map of the newly delineated spatially constrained clusters. Based on the figure below, it shows that the clusters are less fragmented. Furthermore, Cluster 10 appears to be more prevalent (cover wider geographical area) compared to the rest of the clusters.
qtm(nga_wp2_Gcluster, "CLUSTER")
8.4 Multivariate Visualization
For the code chunk below, we will use ggparcoord()
of GGally package to create parallel coordinate plots as part of revealing clustering variables by clusters.
ggparcoord(data = nga_wp2_Gcluster,
columns = c(25:28,30),
scale = "globalminmax",
alphaLines = 0.2,
boxplot = TRUE,
title = "Multiple Parallel Coordinates Plots of 5 Different Variables by Cluster") +
facet_grid(~ CLUSTER) +
theme(axis.text.x = element_text(angle = 90))
Based on the results above, it shows that Cluster 3, 4 and 10 tend to have a higher proportion of functional water points. However Cluster 3, 4, 10 and 14 exhibit lower proportion of usage capacity below 1000.
In the code chunk below, group_by()
and summarise()
of dplyr are used to derive mean values of the clustering variables.
%>%
nga_wp2_Gcluster st_set_geometry(NULL) %>%
group_by(CLUSTER) %>%
summarise(mean_pct_functional = mean(pct_functional),
mean_pct_non.functional = mean(pct_non.functional),
mean_pct_handpump = mean(pct_handpump),
mean_pct_usagecap_below1000 = mean(pct_usagecap_below1000),
mean_pct_rural = mean(pct_rural))
9. Conclusion
In summary, spatially constrained clustering is useful in segmenting geographical populations for better allocation of resources. The use of hierarchical clustering via agglomerative approach helps to group dissimilar groups together with no apriori information about the number of clusters required.
10. Future Work
To further enhance the quality of data analysis, I proposed the following areas:
- Socio-demographics - Instead of looking at the proportion of water points and technology of water points, we can attempt to understand socio-demographics of the population e.g. affluence, GDP per capita, population size etc. Given that Nigeria is a relatively large country with heterogeneous population and development levels, we can expect that each region has its separate set of needs and thus consumption behaviour of water points and type of technology used to accommodate the varying consumption levels.
- k-nearest neighbour - I would also propose to have more samples for different neighbours to compare the results as this will help strengthen the justification of high priority regions that requires attention.